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Abstract 

Convolutional neural networks ( CNN) have recently shown 
outstanding image classification performance in the large- 
scale visual recognition challenge (ILSVRC2012). The suc- 
cess of CNNs is attributed to their ability to learn rich mid- 
level image representations as opposed to hand-designed 
low-level features used in other image classification meth- 
ods. Learning CNNs, however, amounts to estimating mil- 
lions of parameters and requires a very large number of 
annotated image samples. This property currently prevents 
application of CNNs to problems with limited training data. 

In this work we show how image representations learned 
with CNNs on large-scale annotated datasets can be effi- 
ciently transferred to other visual recognition tasks with 
limited amount of training data. We design a method to 
reuse layers trained on the ImageNet dataset to compute 
mid-level image representation for images in the PASCAL 
VOC dataset. We show that despite differences in image 
statistics and tasks in the two datasets, the transferred rep- 
resentation leads to significantly improved results for object 
and action classification, outperforming the current state of 
the art on Pascal VOC 2007 and 2012 datasets. We also 
show promising results for object and action localization. 

1. Introduction 

Object recognition has been a driving motivation for re- 
search in computer vision for many years. Recent progress 
in the field has allowed recognition to scale up from a few 
object instances in controlled setups towards hundreds of 
object categories in arbitrary environments. Much of this 
progress has been enabled by the development of robust 
image descriptors such as SIFT [ ] and HOG [ ], bag- 
of-features image representations [ , , , ] as well 
as deformable part models [ ]. Another enabling factor 

has been the development of increasingly large and realis- 
tic image datasets providing object annotation for training 
and testing, such as Caltech256 [ ], Pascal VOC [ ] and 

ImageNet [9]. 

Although being less common in recent years, neural net- 
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Figure 1 : Recognition and localization results of our method for 
a Pascal VOC test image. Output maps are shown for six object 
categories with the highest responses. 


works have a long history in visual recognition. Rosen- 
blatt’s Mark I Perceptron [ ] arguably was one of the 

first computer vision systems. Inspired by the neural con- 
nectivity pattern discovered by Hubei and Wiesel [ ], 

Fukushima’s Neocognitron [ ] extended earlier networks 

with invariance to image translations. Combining the back- 
propagation algorithm [ )] with the Neocognitron archi- 
tecture, convolutional neural networks [ , 9] quickly 

achieved excellent results in optical character recognition 
leading to large-scale industrial applications [ , ]. 

Convolutional neural networks (CNN) are high-capacity 
classifiers with very large numbers of parameters that must 
be learned from training examples. While CNNs have been 
advocated beyond character recognition for other vision 
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tasks [ , ] including generic object recognition [ ], 

their performance was limited by the relatively small sizes 
of standard object recognition datasets. 

Notably, many successful image classification pipelines 
share aspects of the Neocognitron and convolutional neural 
networks. Quantizing and spatially aggregating local de- 
scriptors [ , , ] arguably produces low-level image fea- 

tures comparable to those computed by the first two layers 
of the Neocognitron. It is therefore possible that these man- 
ually designed pipelines only outperformed earlier CNNs 
because CNNs are hard to train using small datasets. 

This situation has changed with the appearance of the 
large-scale ImageNet dataset [ ] and the rise of GPU com- 
puting. Krizhevsky et al. [ ] achieve a performance leap 

in image classification on the ImageNet 2012 Large-Scale 
Visual Recognition Challenge (ILSVRC-2012), and further 
improve the performance by training a network on all 15 
million images and 22,000 ImageNet classes. As much as 
this result is promising and exciting, it is also worrysome. 
Will we need to collect millions of annotated images for 
each new visual recognition task in the future? 

It has been argued that computer vision datasets have 
significant differences in image statistics [ )]. For ex- 
ample, while objects are typically centered in Caltech256 
and ImageNet datasets, other datasets such as Pascal VOC 
and LabelMe are more likely to contain objects embed- 
ded in a scene (see Figure 3). Differences in viewpoints, 
scene context, “background” (negative class) and other fac- 
tors, inevitably affect recognition performance when train- 
ing and testing across different domains [ , , ] . Sim- 
ilar phenomena have been observed in other areas such as 
NLP [ ]. Given the “data-hungry” nature of CNNs and the 

difficulty of collecting large-scale image datasets, the appli- 
cability of CNNs to tasks with limited amount of training 
data appears as an important open problem. 

To address this problem, we propose to transfer im- 
age representations learned with CNNs on large datasets to 
other visual recognition tasks with limited training data. In 
particular, we design a method that uses ImageNet-trained 
layers of CNN to compute efficient mid-level image repre- 
sentation for images in Pascal VOC. We analyze the transfer 
performance and show significant improvements on the Pas- 
cal VOC object and action classification tasks, outperform- 
ing the state of the art. We also show promising results for 
object and action localization. Results of object recognition 
and localization by our method are illustrated in Figure 1 . 

In the following we discuss related work in Section 2. 
Sections 3 and 4 present our method and experiments, re- 
spectively. 

2. Related Work 

Our method is related to numerous works on transfer 
learning, image classification, and deep learning, which we 
briefly discuss below. 


Transfer learning. Transfer learning aims to transfer 
knowledge between related source and target domains [ ]. 

In computer vision, examples of transfer learning in- 
clude [ , ] which try to overcome the deficit of training 

samples for some categories by adapting classifiers trained 
for other categories. Other methods aim to cope with differ- 
ent data distributions in the source and target domains for 
the same categories, e.g. due to lighting, background and 
view-point variations [ , , ]. These and other related 

methods adapt classifiers or kernels while using standard 
image features. Differently to this work, we here transfer 
image representations trained on the source task. 

More similar to our work, [ ] trains CNNs on unsuper- 
vised pseudo-tasks. Differently to [ ] we pre-train the con- 
volutional layers of CNNs on a large-scale supervised task 
and address variations in scale and position of objects in 
the image. Transfer learning with CNNs has been also ex- 
plored for Natural Language Processing [3] in a manner 
closely related to our approach. Other recent efforts done in 
parallel with our work also propose transferring image rep- 
resentations learnt from the large-scale fully-labelled Ima- 
geNet dataset using the convolutional neural network archi- 
tecture of [ ]. However, they investigate transfer to other 

visual recognition tasks such as Caltech256 image classi- 
fication [ ], scene classification [ ] and object localiza- 

tion [17, 42]. 

Visual object classification. Most of the recent im- 
age classification methods follow the bag-of-features 
pipeline [ ]. Densely-sampled SIFT descriptors [ ] are 

typically quantized using unsupervised clustering (k-means, 
GMM). Histogram encoding [ , ], spatial pooling [ ] 

and more recent Fisher Vector encoding [ ] are common 

methods for feature aggregation. While such representa- 
tions have been shown to work well in practice, it is unclear 
whether they should be optimal for the task. This question 
raised considerable interest in the subject of mid-level fea- 
tures [ , , ], and feature learning in general [ , ,47]. 

The goal of this work is to show that convolutional network 
layers provide generic mid-level image representations that 
can be transferred to new tasks. 

Deep Learning. The recent revival of interest in multi- 
layer neural networks was triggered by a growing number of 
works on learning intermediate representations, either using 
unsupervised methods, as in [ , ], or using more tradi- 
tional supervised techniques, as in [ , ]. 

3. Transferring CNN weights 

The CNN architecture of [ ] contains more than 60 mil- 

lion parameters. Directly learning so many parameters from 
only a few thousand training images is problematic. The 
key idea of this work is that the internal layers of the CNN 
can act as a generic extractor of mid-level image represen- 
tation , which can be pre-trained on one dataset (the source 
task , here ImageNet) and then re-used on other target tasks 
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Figure 2: Transferring parameters of a CNN. First, the network is trained on the source task (ImageNet classification, top row) with 
a large amount of available labelled images. Pre-trained parameters of the internal layers of the network (C1-FC7) are then transferred to 
the target tasks (Pascal YOC object or action classification, bottom row). To compensate for the different image statistics (type of objects, 
typical viewpoints, imaging conditions) of the source and target data we add an adaptation layer (fully connected layers FCa and FCb) and 
train them on the labelled data of the target task. 


(here object and action classification in Pascal VOC), as il- 
lustrated in Figure 2. However, this is difficult as the la- 
bels and the distribution of images (type of objects, typical 
viewpoints, imaging conditions, etc.) in the source and tar- 
get datasets can be very different, as illustrated in Figure 3. 
To address these challenges we (i) design an architecture 
that explicitly remaps the class labels between the source 
and target tasks (Section 3.1), and (ii) develop training and 
test procedures, inspired by sliding window detectors, that 
explicitly deal with different distributions of object sizes, 
locations and scene clutter in source and target tasks (Sec- 
tions 3.2 and 3.3). 


For target tasks (Pascal VOC object and action classifica- 
tion) we wish to design a network that will output scores for 
target categories, or background if none of the categories 
are present in the image. However, the object labels in the 
source task can be very different from the labels in the tar- 
get task (also called a “label bias” [ ]). For example, the 

source network is trained to recognize different breeds of 
dogs such as husky dog or australian terrier, but the 
target task contains only one label dog. The problem be- 
comes even more evident for the target task of action classi- 
fication. What object categories in ImageNet are related to 
the target actions reading or running ? 


3.1. Network architecture 


For the source task, we use the network architec- 
ture of Krizhevsky et al. [ ]. The network takes as 
input a square 224 x 224 pixel RGB image and pro- 
duces a distribution over the ImageNet object classes. 
This network is composed of five successive convolu- 
tional layers C1...C5 followed by three fully connected 
layers FC6. . . FC8 (Figure 2, top). Please refer to [ ] 

for the description of the geometry of the five convolu- 
tional layers and their setup regarding contrast normaliza- 
tion and pooling. The three fully connected layers then 
compute Y 6 = cr(W 6 Y 5 -j- B 6 ), Y 7 = cr(W 7 Y 6 + B 7 ), 
and Yg = ^(WgYy -f B 8 ), where Y& denotes the out- 
put of the k - th layer, W&, B& are the trainable param- 
eters of the k - th layer, and cr(X)[i] = max(0, X[i]) and 
V>(X)[i] = e x M / yv e x M are the “ReLU” and “SoftMax” 
non-linear activation functions. 


In order to achieve the transfer, we remove the output 
layer FC8 of the pre-trained network and add an adaptation 
layer formed by two fully connected layers FCa and FCb 
(see Figure 2, bottom) that use the output vector Y 7 of the 
layer FC7 as input. Note that Y 7 is obtained as a complex 
non-linear function of potentially all input pixels and may 
capture mid-level object parts as well as their high-level 
configurations [ , ]. The FCa and FCb layers compute 

Y a = ( r(W 0 Y 7 + B a ) and Y b = ^{W h Y a + B fe ), where 
W 0 , B a , W 6 , B b are the trainable parameters. In all our 
experiments, FC6 and FC7 have equal sizes (either 4096 or 
6144, see Section 4), FCa has size 2048, and FCb has a size 
equal to the number of target categories. 

The parameters of layers Cl . . . C5, FC6 and FC7 are first 
trained on the source task, then transferred to the target task 
and kept fixed. Only the adaptation layer is trained on the 
target task training data as described next. 
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Figure 3: Illustration of different dataset statistics between the 
source (ImageNet) and target (Pascal VOC) tasks. Pascal VOC 
data displays objects embedded in complex scenes, at various 
scales (right), and in complex mutual configurations (middle). 
Left: Image from ImageNet with label maltese terrier. 
Middle and right: Images from Pascal YOC with label dog. 


3.2. Network training 

First, we pre-train the network using the code of [ ] on 

the ImageNet classification source task. Each image typi- 
cally contains one object centered and occupying significant 
portion of the image with limited background clutter as il- 
lustrated in Figure 3 (left). The network is trained to predict 
the ImageNet object class label given the entire image as 
input. Details are given in Section 4. 

As discussed above, the network is pre-trained to clas- 
sify source task images that depict single centered objects. 
The images in the target task, however, often depict com- 
plex scenes with multiple objects at different scales and ori- 
entations with significant amount of background clutter, as 
illustrated in Figure 3 (middle and right). In other words, 
the distribution of object orientations and sizes as well as, 
for example, their mutual occlusion patterns is very differ- 
ent between the two tasks. This issue has been also called 
“a dataset capture bias” [ ]. In addition, the target task 

may contain many other objects in the background that are 
not present in the source task training data (a “negative data 
bias” [ ]). To explicitly address these issues we train the 

adaptation layer using a procedure inspired by training slid- 
ing window object detectors (e.g. [ ]) described next. 

We employ a sliding window strategy and extract around 
500 square patches from each image by sampling eight dif- 
ferent scales on a regularly-spaced grid with at least 50% 
overlap between neighboring patches. More precisely, we 
use square patches of width s = min (w, h) / A pixels, where 
w and h are the width and height of the image, respectively, 
and A E {1,1. 3 , 1.6,2, 2. 4, 2. 8 , 3. 2, 3. 6 , 4}. Each patch is rescaled 
to 224 x 224 pixels to form a valid input for the network. 

Sampled image patches may contain one or more ob- 
jects, background, or only a part of the object. To label 
patches in training images, we measure the overlap between 
the bounding box of a patch P and ground truth bounding 
boxes B of annotated objects in the image. The patch is la- 
belled as a positive training example for class o if there ex- 
ists a B 0 corresponding to class o such that (i) B 0 overlaps 
sufficiently with the patch \P H B 0 \ > 0.2 |P|, (ii) the patch 
contains large portion of the object |P D B 0 \ > 0.6|P o |, 
and (iii) the patch overlaps with no more than one object. 
In the above definitions \A\ measures the area of the bound- 
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Figure 4: Generating training data for the target task. The 

input image (top) is divided into multi- scale overlapping patches 
(bottom). Each patch is labelled with an object label (green) or 
as background (red) depending on the overlap with object bound- 
ing boxes. Note that object patches are similar in appearance to 
the training data for the source task containing mostly centered 
objects. 


ing box A. Our labeling criteria are illustrated in Figure 4. 

Dealing with background. As discussed above, the tar- 
get task has an additional background label for patches 
that do not contain any object. One additional difficulty 
is that the training data is unbalanced: most patches from 
training images come from background. This can be ad- 
dressed by re-weighting the training cost function, which 
would amount to re-weighting its gradients during train- 
ing. We opt for a slightly different procedure and instead 
re- sample the training patches to balance the training data 
distribution. This resampled training set is then used to 
form mini-batches for the stochastic gradient descent train- 
ing. This is implemented by sampling a random 10% of the 
training background patches. 

3.3. Classification 

At test time we apply the network to each of the (ap- 
proximately) 500 overlapping multi- scale patches extracted 
from the test image. Examples of patch scores visualized 
over entire images are shown in Figures 1 and 5. We use 
the following aggregation formula to compute the overall 
score for object C n in the image 


score(C n ) = -L jA y(C n \Pi) k , (1) 

2 = 1 

where y(C n \Pi) is the output of the network for class C n 
on image patch Pj , M is the number of patches in the im- 
age, and k > 1 is a parameter. Higher values of k focus on 
the highest scoring patches and attenuate the contributions 






of low- and mid- scoring patches. The value of k = 5 was 
optimized on the validation set and is fixed in our experi- 
ments. 

Note that patch scores could be computed much more 
efficiently by performing large convolutions on adequately 
subsampled versions of the full image, as described for in- 
stance in [ ] . This would permit a denser patch coverage 

at a lower computation cost. 

4. Experiments 

In this section we first describe details of training, and 
discuss pre-training results for the source task of ImageNet 
object classification. We next show experimental results of 
the proposed transfer learning method on the target Pascal 
VOC object classification task for both VOC 2007 and VOC 
2012 datasets. We also investigate the dependency of results 
on the overlap of source and target tasks by object classes. 
Finally, we apply the proposed transfer learning method on 
a very different task of action recognition in still images. 

Training convolutional networks. All our training ses- 
sions were carried out using the code provided by 
Krizhevsky et al. [ 4] and replicating their exact dropout 
and jittering strategies. However, we do not alter the RGB 
intensities and we use a single GeForce GTX Titan GPU 
with 6GB of memory instead of the two GPUs of earlier 
generation used in [ ] . The training procedure periodically 

evaluates the cross-entropy objective function on a subset of 
the training set and on a validation set. The initial learning 
rates are set to 0.01 and the network is trained until the train- 
ing cross-entropy is stabilized. The learning rates are then 
divided by 10 and the training procedure repeats. We stop 
training after three iterations. We have not tuned parame- 
ters for this part of the algorithm and we did not observe 
overfitting on the validation set. 

Image classification on ImageNet. We first train a single 
convolutional network on the 1000 classes and 1.2 million 
images of the ImageNet 2012 Large Scale Visual Recogni- 
tion Challenge (ILSVRC-2012). This network has exactly 
the same structure as the network described in [ ]. Lay- 

ers FC6 and FC7 have 4096 units. Training lasts about one 
week. The resulting network achieves a 18% top-5 error 
rate 1 , comparable to the 17% reported by [ ] for a single 

network. This slight performace loss could be caused by the 
absence of RGB intensity manipulation in our experiments. 

Image classification on Pascal VOC 2007. We apply our 
mid-level feature transfer scheme to the Pascal VOC 2007 
object classification task. Results are reported in Table 1. 
Our transfer technique (PRE-1000C) demonstrates signifi- 
cant improvements over previous results on this data outper- 
forming the 2007 challenge winners [ ] (Inria) by 18.3% 

and the more recent work of [ 6] (Nus-psl) by 7.2%. 

1 5 guesses are allowed. 


Image classification on Pascal VOC 2012. We next ap- 
ply our method to the Pascal VOC 2012 object classifica- 
tion task. Results are shown in the row PRE-1000C of Ta- 
ble 2. Although these results are on average about 4% infe- 
rior to those reported by the winners of the 2012 challenge 
(NUS-PSL [ ]), our method outperforms [ ] on five out 

of twenty classes. To estimate the performance boost pro- 
vided by the feature transfer, we compare these results to 
the performance of an identical network directly trained on 
the Pascal VOC 2012 training data (No pretrain) without 
using any external data from ImageNet. Notably, the per- 
formance drop of nearly 8% in the case of No PRETRAIN 
clearly indicates the positive effect of the proposed transfer. 

Transfer learning and source/target class overlap. Our 

source ILSVRC-2012 dataset contains target-related object 
classes, in particular, 59 species of birds and 120 breeds of 
dogs related to the bird and dog classes of Pascal VOC. To 
understand the influence of this overlap on our results, we 
have pre-trained the network on a source task data formed 
by 1,000 ImageNet classes selected, this time, at random 
among all the 22,000 available ImageNet classes. Results 
of this experiment are reported in Table 2, row Pre- 1 000R. 
The overall performance has decreased slightly, indicating 
that the overlap between classes in the source and target do- 
mains may have a positive effect on the transfer. Given the 
relatively small performance drop, however, we conclude 
that our transfer procedure is robust to changes of source 
and target classes. As the number of training images in this 
experiment was about 25% smaller than in the ILSVRC- 
2012 training set (PRE-1000C), this could have been an- 
other reason for the decrease of performance. 

Conversely, we have augmented the 1 ,000 classes of the 
ILSVRC-2012 training set with 512 additional ImageNet 
classes selected to increase the overlap with specific 
classes in the Pascal VOC target task. We included all 
the ImageNet classes located below the hoofed mammal 
(276 classes), furniture (165), motor vehicle (48), 
public transport (18), bicycle (5) nodes of the 
WordNet hierarchy. In order to accommodate the larger 
number of classes, we also increased the size of the FC6 and 
FC7 layers from 4,096 to 6,144 dimensions. Training on the 
resulting 1.6 million images achieves a 21.8% top-5 error 
rate on the 1,5 12 classes. Using this pre-trained network we 
have obtained further improvements on the target task, out- 
performing the winner of Pascal VOC 2012 [ ] on average 

(row Pre- 15 12 in Table 2). In particular, improvements 
are obtained for categories (cow, horse, sheep, sofa, 
chair, table) related to the added classes in the source 
task. By comparing results for PRE-1000R, PRE-1000C 
and Pre- 1512 setups, we also note the consistent improve- 
ment of all target classes. This suggests that the number of 
images and classes in the source task might be decisive for 
the performance in the target task. Hence, we expect further 
improvements by our method using larger source tasks. 
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82.5 
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54.2 

75.0 
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62.7 

41.4 

74.6 

85.0 

76.8 

91.1 

53.9 
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83.6 

70.6 

70.5 

Pre-IOOOC 

88.5 

81.5 

87.9 

82.0 

47.5 

75.5 
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87.2 

61.6 
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67.3 

85.5 
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80.0 
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Table 1: Per-class results for object classification on the YOC2007 test set (average precision %). 
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79.0 
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Table 2: Per-class results for object classification on the VOC2012 test set (average precision %). 
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Stanford [ ] 

75.7 44.8 66.6 44.4 93.2 94.2 87.6 38.4 70.6 75.6 

69.1 

Oxford [1] 

77.0 50.4 65.3 39.5 94.1 95.9 87.7 42.7 68.6 74.5 

69.6 

No PRETRAIN 

43.2 30.6 50.2 25.0 76.8 80.7 75.2 22.2 37.9 55.6 

49.7 

Pre-1512 

73.4 44.8 74.8 43.2 92.1 94.3 83.4 45.7 65.5 66.8 

68.4 

Pre-1512U 

74.8 46.0 75.6 45.3 93.5 95.0 86.5 49.3 66.7 69.5 

70.2 


Table 3: Pascal YOC 2012 action classification results (AP %). 


Varying the number of adaptation layers. We have also 
tried to change the number of adaptation layers in the best 
performing PRE-1512 training set-up. Using only one fully 
connected adaptation layer FCb of size 21 (the number of 
categories) results in about 1% drop in performance. Simi- 
larly, increasing the number of adaptation layers to three (of 
sizes 2048, 2048 and 21, respectively) also results in about 
1% drop in classification performance. 

Object localization. Although our method has not been 
explicitly designed for the task of localization, we have 
observed strong evidence of object and action localization 
provided by the network at test time. For qualitative as- 
sessment of localization results, we compute an output map 
for each category by averaging the scores of all the testing 
patches covering a given pixel of the test image. Examples 
of such output maps are given in Figures 1 and 5 as well 
as on the project webpage [ ]. This visualization clearly 
demonstrates that the system knows the size and locations 
of target objects within the image. Addressing the detection 
task seems within reach. 

Action recognition. The Pascal YOC 2012 action recog- 
nition task consists of 4588 training images and 4569 test 
images featuring people performing actions among ten cate- 
gories such as jumping, phoning, playing instrument 
or reading. This fine-grained task differs from the 
object classification task because it entails recognizing 
fine differences in human poses (e.g. running v.s. 
walking) and subtle interactions with objects (phoning 
or taking photo). Training samples with multiple simul- 
taneous actions are excluded from our training set. 

To evaluate how our transfer method performs on this 
very different target task, we use a network pre-trained 
on 1512 ImageNet object classes and apply our transfer 
methodology to the Pascal YOC action classification task. 


Since the bounding box of the person performing the ac- 
tion is known at testing time, both training and testing are 
performed using a single square patch per sample, centered 
on the person bounding box. Extracting the patch pos- 
sibly involves enlarging the original image by mirroring 
pixels. The results are summarized in row Pre-1512 Ta- 
ble 3. The transfer method significantly improves over the 
N o pretrain baseline where the CNN is trained solely on 
the action images from Pascal YOC, without pretraining on 
ImageNet. In particular, we obtain best results on challeng- 
ing categories playing instrument and taking photo. 

In order to better adapt the CNN to the subtleties of the 
action recognition task, and inspired by [ ], our last re- 
sults were obtained by training the target task CNN with- 
out freezing the FC6 weights. More precisely, we copy 
the ImageNet- trained weights of layers Cl. . . C5, FC6 and 
FC7, we append the adaptation layers FCa and FCb, and we 
retrain layers FC6, FC7, FCa, and FCb on the action recog- 
nition data. This strategy increases the performance on all 
action categories (row Pre-15 12U in Table 3), yielding, to 
the best of our knowledge, the best average result published 
on the Pascal YOC 2012 action recognition task. 

To demonstrate that we can also localize the action in the 
image, we train the network in a sliding window manner, as 
described in Section 3. In particular, we use the ground truth 
person bounding boxes during training, but do not use the 
ground truth person bounding boxes at test time. Example 
output maps shown in Figure 5 clearly demonstrate that the 
network provides an estimate of the action location in the 
image. 

Failure modes. Top-ranked false positives in Figure 5 
correspond to samples closely resembling target object 
classes. Resolving some of these errors may require high- 
level scene interpretation. Our method may also fail to 
recognize spatially co-occurring objects (e.g., person on a 
chair) since patches with multiple objects are currently ex- 
cluded from training. This issue could be addressed by 
changing the training objective to allow multiple labels per 
sample. Recognition of very small or very large objects 
could also fail due to the sparse sampling of patches in our 
current implementation. As mentioned in Section 3.3 this 


issue could be resolved using a more efficient CNN-based 
implementation of sliding windows. 

5. Conclusion 

Building on the performance leap achieved by [ ] on 

ILSVRC-2012, we have shown how a simple transfer learn- 
ing procedure yields state-of-the-art results on challenging 
benchmark datasets of much smaller size. We have also 
demonstrated the high potential of the mid-level features 
extracted from an ImageNet- trained CNNs. Although the 
performance of this setup increases when we augment the 
source task data, using only 12% of the ImageNet corpus al- 
ready leads to the best published results on the Pascal VOC 
2012 classification and action recognition tasks. Our work 
is part of the recent evidence [ , , , ] that convolu- 

tional neural networks provide means to learn rich mid-level 
image features transferrable to a variety of visual recogni- 
tion tasks. The code of our method is available at [ ]. 
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